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Population stratification is a problem encountered in several areas of biology and public health. 
We tackle this problem by mapping a population and its elements attributes into a hypergraph, a 
natural extension of the concept of graph or network to encode associations among any number of 
elements. On this hypergraph, we construct a statistical model reflecting our intuition about how the 
elements attributes can emerge from a postulated population structure. Finally, we introduce the 
concept of stratification representativeness as a mean to identify the simplest stratification already 
containing most of the information about the population structure. We demonstrate the power of 
this framework stratifying an animal and a human population based on phenotypic and genotypic 
properties, respectively. 



INTRODUCTION 



A population stratification problem consist of uncover- 
ing the structure of a population of individuals, samples 
or elements given a list of attributes characterizing them. 
For example, the design of a zoo require us to under- 
stand what is the best way to allocate different animals 
in different zoo locations depending on their habitat, be- 
havior, and other properties. The traditional approach 
to tackle this problem is based on a mapping into a net- 
work problem [1-5] , where nodes or vertices represent the 
population elements, the links or edges represent pairwise 
relations between the elements, and the edge weights ac- 
count for the degree of similarity or dissimilarity between 
the corresponding elements. 

In several population stratification problems it is clear, 
however, that the system under consideration is char- 
acterized by relationships involving more than two ele- 
ments. For example the property - mammal - divides 
the animal population into two groups: non-mammals 
and mammals, each containing several elements. Hyper- 
graphs can be used to represent associations beyond pair- 
wise relations. A hypergraph is an intuitive extension of 
the concept of graph or network where the edges are sets 
of any number of elements. For example, in an animal 
population, an edge can represent an association between 
all animals with a given property, all airborne animals for 
example. 

We consider hypergraphs as a suitable mathematical 
structure to represent a population of elements and their 
attributes. We introduce a statistical model on the pop- 
ulation attributes hypergraph as a mean to solve the in- 
verse problem, finding the population stratification given 
the population elements and their associations according 
to certain attributes. We go over technical issues as- 
sociated with the framework and its application to real 
examples as well. 
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FIG. 1: Hypergraph: a) A hypergraph with three edges. 
Each edge is represented by a circle and its composed by the 
nodes within the circle, b) Bipartite graph representation of 
the hypergraph in a), the squares representing the hypergraph 
edges. 



HYPERGRAPH REPRESENTATION 

A hypergraph is an intuitive extension of the concept 
of a graph or network where the nodes represent the sys- 
tems elements and the edges (also called hyperedges) are 
sets of any number of elements (Fig. la). This math- 
ematical construction is very useful to represent a pop- 
ulation of elements and their attributes. For example, 
consider the animal population in Fig. 2a together with 
their attributes: habitat, nutrition behavior, etc. In this 
case the hypergraph nodes represent animals. Further- 
more, we can use an edge to represent the association 
between all animals with a given attribute: edgel, all 
non-airborne animals; edge2, all airborne animals, and 
so on (Fig. 2b). 

This mapping is applicable when the attributes are 
given by genetic information as well. For example, con- 
sider a human population for which we know which nu- 
cleotides (represented by the letters A, C, G and T) are 
present at specific chromosomes and chromosomes po- 
sitions. Since humans have two copies of each gene, we 
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aardvark antelope bass bear boar buffalo calf carp catfish cavy ... 

chicken crow dove duck flamingo fruitbat gnat gull hawk honeybee ... 
aardvark antelope bear boar buffalo calf cavy cheetah chicken clam ... 
bass carp catfish chub crab crayfish dogfish dolphin duck frog gull ... 
clam crab crayfish flea gnat honeybee housefly ladybird lobster moth ... 
aardvark antelope bass bear boar buffalo calf carp catfish cavy ... 
bass carp catfish chub clam crab crayfish dogfish haddock herring ... 
aardvark antelope bear boar buffalo calf cavy cheetah chicken crow ... 
bass carp catfish cavy chicken chub clam crab crayfish crow dove duck ... 
aardvark antelope bear boar buffalo calf cheetah deer dogfish dolphin ... 
aardvark antelope bass bear boar buffalo catfish cheetah chub clam ... 
calf carp cavy chicken dove goat hamster honeybee parakeet pony ... 
aardvark antelope bear boar buffalo calf cavy cheetah deer dolphin ... 
bass carp catfish chicken chub clam crab crayfish crow dogfish dove ... 
aardvark antelope bass bear boar buffalo calf carp catfish cavy ... 
chicken crow dove duck flamingo gull hawk kiwi lark ostrich parakeet ... 
aardvark antelope bear boar buffalo calf cavy cheetah chicken clam ... 
bass carp catfish chub dogfish dolphin haddock herring pike piranha ... 
bass carp catfish chicken chub clam crab crayfish crow dogfish dolphin ... 
aardvark antelope bear boar buffalo calf cavy cheetah deer elephant ... 
bass carp catfish chub clam dogfish dolphin haddock herring pike ... 
chicken crow dove duck flamingo fruitbat gorilla gull hawk kiwi lark ... 
aardvark antelope bear boar buffalo calf cavy cheetah crab deer ... 
starfish 

crayfish flea gnat honeybee housefly ladybird lobster moth termite ... 

octopus scorpion 

bass carp catfish chicken chub clam crab crayfish crow dogfish dove ... 
aardvark antelope bear boar buffalo calf cavy cheetah deer dolphin ... 



Group 1 antelope buffalo calf cavy deer elephant fruitbat giraffe goat gorilla hamster hare oryx pony reindeer squirrel vampire vole wallaby 
Group 2aardvark bear boar cheetah leopard lion lynx mole mongoose opossum polecat puma housecat raccoon wolf 
Group Sdolphin mink platypus porpoise seal sealion 

Group 4chicken crow dove duck flamingo gull hawk kiwi lark ostrich parakeet penguin pheasant rhea skimmer skua sparrow swan vulture wren 

Group 5bass carp catfish chub dogfish haddock herring pike piranha seahorse seasnake sole stingray tuna 

Group 6frog newt pitviper slowworm toad tortoise tuatara 

Group 7flea gnat honeybee housefly ladybird moth slug termite wasp worm 

Group Sclam crab crayfish lobster octopus scorpion seawasp starfish 



FIG. 2: Stratification according to phenotypic attributes: a) A list of animals is given together with certain attributes 
characterizing them. The complete datasot is available from [6] . Except for the attribute - logs - one and zero indicate possession 
or not, respectively, of the corresponding attribute. The problem consist on determining the optimal stratification of the animal 
population based on the provided attributes, b) Hypergraph representing the zoo data. Each line corresponds with an edge, 
whose elements are specified within the right column, c) ML stratification for the case of eight groups. 



have two letters for each position. A scenario could be the 
presence of one of the letters A or G at a given position, 
resulting in the combinations AA, AG and GG. When 
these combinations appear in a significant frequency in 
the population they are referred as a single nucleotide 
polymorphism (SNP). This genetic information can be 
mapped into a hypergraph. The vertices in the hyper- 
graph represent individuals and the edges now represent 
groups of individuals with the same genetic information 
at a given position: edgel, all individuals with call AA 
for SNPl; edge2, all individuals with call AG for SNPl; 
and so on (Fig. 3). 

STATISTICAL MODEL 

After identifying hypergraphs as a suitable mathemat- 
ical stnicturc to represent a population and their at- 
tributes we focus on determining how to use this informa- 
tion to solve the inverse problem, finding the population 
stratification given the population elements and their as- 
sociations according to certain attributes. Our working 



hypothesis is that i) the population is divided in groups 
and (ii) the elements of each group are characterized by 
a different combination of attributes. The later do not 
exclude the possibility that two groups exhibit one same 
attribute, being different according to others. These hy- 
potheses are the basics for the following statistical model 
on hypergraphs: 

Data: Consider a population of n individuals and a hy- 
pergraph with m edges characterizing the relationships 
among them. The hypergraph can be specified, for ex- 
ample, using the adjacency matrix a, where Uij = 1 if 
element i belongs to edge j and it is zero otherwise. 

Model: The population is divided into Ug groups and 
let gi, i = 1, . . . , n, denote the group to which node i be- 
longs. With probability Oij an element of group i belongs 
to edge j. 

Likelihood: The likelihood to observe the data given 
this model is 

n m 
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FIG. 3: Mapping genotypic information into a hyper- 
graph: a) A population of individuals, labeled by 1,2,3..., 
is given together with their genotype for specific DNA posi- 
tions within. These positions have been selected because they 
exhibit significant variation across the human population, re- 
ferred as single nucleotide polymorphisms (SNPs), and are 
labeled using the standard SNP notation: rsNUMBER. The 
letters A, C, G and T represent nucleotides and two letters 
are reported because each DNA position appear in two dif- 
ferent chromosome copies, b) Hypergraph representing the 
genotypic data. Each line corresponds with an edge, whose 
elements are specified within the right column. 



In essence the likelihood (1) is a mathematical represen- 
tation of our intuition about the observation of the hy- 
pergraph given a population stratification, i.e. elements 
of the same group have the same probability to exhibit 
certain attribute and thus to belong to the edge repre- 
senting that attribute. In the following we discuss how 
to determine the best choice of model parameters {g, 6) 
and Ug. 

The likelihood (1) resembles that introduced in [4] in 

the context of finding communities on graphs. Despite 
the similarity and being a source of inspiration, they are 
quite different in their interpretation. A hypergraph can 
be indeed represented by a bipartite graph, with one type 
of nodes corresponding to the hypergraph nodes and an- 
other representing the hypergraph edges (Fig. lb). In 
this work we focus, however, on clustering the original 
hypergraph nodes alone. Therefore, the likelihood in (1) 
represents a statistical model on a hypergraph. In con- 
trast, a true statistical model on a bipartite graph should 
attend to cluster both types of nodes, the original hyper- 
graph nodes and the attribute nodes. There are other 
technical differences. Here we model the stratification 
encoded in g as parameters, while they were modeled as 
hidden variables in [4]. Hence, although similar in form, 
the likelihood in (1) is different from that in [4]. 



MAXIMUM LIKELIHOOD STRATIFICATION 

The model defined above belongs to the class of finite 
mixture models [2]. Thus, we can obtain the optimal 
stratification using techniques applicable to finite mix- 



ture models in general. In particular, we use the well es- 
tablished Expectation Maximization (EM) algorithm [7] 
to determine the maximum likelihood (ML) stratification 

given a fixed number of groups. 

ML stratification: First, we compute the expectation of 
the log-likelihood £ = logP(a|5f,^) with respect to the 
probability qtr that element i belongs to group r, obtain- 
ing 

n g rn 

E[C] = X] ["'J' ^'^i + (1 ~ "-a) log(l ~ ^rj)] ■ 

i=l r=l j=l 

(2) 

Second, we compute the parameters 9 that maximize (2), 
resulting in 



En 
_ i=i QirO'ij 
Vrj — 

Z^i=l 'iir 



(3) 



Finally, q is estimated using 



T:iLiP{a\g,g, = s,e) 



(4) 



Starting from an initial condition we iterate the equations 
(3) and (4) until the change of all q elements is smaller 
than a predefined precision. The EM algorithm always 
converge to a local maximum of the likelihood, which 
may o may not coincide with the global maximum. One 
approach to explore different local maxima, in case they 
exist, consist on generating different initial conditions [2]. 
Here we explore different initial conditions by assigning 
to the q elements the random initial values 



(5) 



where a random number between zero an one. 

Putting all together, starting from each initial condition, 
we iterated equations (3) and (4) until the change of all 
q elements is smaller than 10~^. 



BEST CHOICE OF Ug 

A more subtle issue is to determine the optimal number 
of groups. The standard approach to solve this problem 
is based on the Occam's razor principle: provided differ- 
ent models describing the reality with similar accuracies 
we should select the simplest. In other words, we accept 
an increase in model complexity only provided we obtain 
a signifficantly better description accuracy or predictive 
power. We use the Akaike Information Theoretical Crite- 
rion (AIC) [8] to quantify model complexity. According 
to this criterion, the complexity of a model is determined 
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by the number of independent parameters and the best 
choice of Ug is the one minimizing 



AIC{ng) = — max£ + (n + m) * {ug — 1) , (6) 

where (n + to) * {rig — 1) is the number of independent 
parameters in our statistical model. The first term in 
the right hand side of (6) quantifies the goodness of the 
fit and it decreases with increasing Ug. On the other 
hand, the second term in the right hand quantifies the 
model complexity and increases with increasing rig. The 
optimal choice of Ug results from the balance between 
these two opposite contributions. 

It becomes clear below that the AIC criterion can re- 
sult in too conservative estimates of Ug , forcing us to con- 
sider a different approach. Instead of focusing on model 
complexity we ask the question: given the ensemble of 
all models with different Ug which is the most representa- 
tive among them? To be more precise we need a measure 
to compare the degree of similarity between two differ- 
ent population stratifications Si and Sj, corresponding 
to models with i and j groups, respectively. We consider 
the normalized mutual information [3] 
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FIG. 4: Test example: The best choice of Ug and normalized 

mutual information / between the predicted population opti- 
mal stratification and the original stratification as a function 
of the degree K. These results arc obtained computing the 
optimal stratification for rig = 1, . . . , 20 using the EM algo- 
rithm with one initial condition. The optimal Ug was obtained 
using the AIC (solid circles), the representativeness criterion 
(empty squares) and assuming it equal to four (solid trian- 
gles) . In a) and b) the case study hypergraphs have n = 100 
nodes and m = 10 edges, while in c) and d) the number of 
edges is doubled to m = 20. 



Efe=i Pk log Pi + J2i=i Pi log Pi 

where 

n 

5=1 

s=l 

The normalized mutual information equals zero when the 
stratification Si does not contain any information about 
the stratification Sj, becomes one when the two stratifi- 
cations are identical, and interpolates between zero and 
one for intermediate scenarios. 

For each stratification Si we define stratification rep- 
resentativeness 

R{Si) = , (10) 

the average of the normalized mutual information of all 
stratifications Sj with respect to a given stratification 
Si. The larger is R{Si) the more the stratification Si 
represent the stratification ensemble and thus the name 
of representativeness. Furthermore, we define the most 



representative stratification among an ensemble of strati- 
fications as the stratification maximizing R. In case there 
are more than one stratification satisfying this criteria 
we invoke the Occam's razor principle and select the one 
with the lowest number of groups. 

TEST EXAMPLES 

To test the population stratification framework intro- 
duced above we need hypergraph examples for which the 
stratification is already known. The statistical model 
defined by (1) provides us a straightforward method to 
generate an ensemble of hypergraphs. Indeed, provided 
g and 6 we can generate realizations of the hypergraph 
adjacency matrix using (1). We consider the following 
ensemble of hypergraphs with n nodes and to, edges: (i) 
The population is divided in rig groups of equal size, (ii) 
All nodes have the same degree K, where the degree is 
the number of edges to which a node belongs to. (iii) The 
edges to which the elements of a given group belong to are 
selected at random among the m edges, controlling that 
every pair of groups differ in at least one edge. Provided 
m > rig the later is possible only for 1 < if < to — 1, 
defining our working range for K. 

Using this hypergraph ensemble we generate hyper- 
graphs with n nodes, to edges and degree K. For each 
hypergraph we determine the best choice of rig and the 
corresponding population stratification, using both the 
AIC and representativeness criteria. To compare the pre- 
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FIG. 5: Representativeness plot: Representativeness as a 

function of tiic number of groups for the zoo a) and MDM4 
b) problems. Different symbols indicate different numerical 
accuracies of tfie numerical algorithm to find the ML stratifi- 
cation. The arrow indicates the number of groups maximizing 
the representativeness. The different symbols indicate differ- 
ent number of initial conditions for the EM algorithm, from 
100 (open symbols) to 10,000 (filled symbols). 



dieted optimal stratifieation and the original subdivision 
of the population we use the normalized mutual infor- 
mation (7) [3]. Finally, the results are averaged over 100 

hypergraps for each sot of (n, m, m). 

Figure 4 show the results for n=100,TO=10 and 
m = 20 as a function of the degree K . When we fix, 
a priori, the number of groups to four, the stratification 
method based on (1) is almost finding the right subdi- 
vision. Indeed, the normalized mutual information be- 
tween the predicted stratification in four groups and the 
original subdivision is very close to one, indicating that 
most nodes have been allocated to their original groups 
(solid triangles in Fig. 4b and 4d). While these ob- 
servation does not exclude the existence of hypergraph 
instances where the method can fail, it supports its use 
in real cases. 

Next we test the best choice of Ug when it is not known 
a priori. For m = 10 edges the AIC underestimates Ug, 
particularly for small K (Fig. 4a). Consequently, the 
normalized mutual information between the predicted 
and original subdivision of the population is quite small 
(Fig. 4b). This disagreement persist for to = 20 and 
small values of K, but gets signifficantly improved for K 
larger than four (Fig. 4c, d). In contrast, the represen- 
tativcncs criterion performs quite well for all the tested 
parameter combinations. In average it predicts the right 
number of groups, four, (Fig. 4a) and the normalized 
mutual information is very close to one (Fig. 4b) . Taken 
together these results indicate that the representative- 
ness criterion performs as well if not better that the AIC. 
Hence, in the following we restrict to the former approach 
to select the best choice of Ug. 



REAL EXAMPLES 

Now we proceed to apply the population stratification 
framework to real examples. The first example is the zoo 
problem (Fig. 2a), requiring us to group different animals 
according to their habitat, nutrition behavior, and other 
properties (Fig. 2a). In this case the hypergraph nodes 
represent animals and each edge represents an association 
between animals exhibiting a given phenotypic attribute 
(e.g. cdgcl, all non-airborne animals; edge2, all airborne 
animals, Fig. 2b). 

Figure 2c shows the animal stratification for the zoo 
problem for the case of eight groups. A quick inspection 
shows that elements within the same group have indeed a 
sense of a group. The first three groups contain all mam- 
mals subdivided by their habitat and feeding behavior. 
The remaining groups represent birds, fishes, amphibia- 
reptiles, terrestrial arthropods and aquatic arthropods 
(except the scorpion), in that order. A similar stratifica- 
tion is obtained for the case of seven groups, except for 
groups 1 and 3 that are merged into one group. On the 
other hand, a stratification into nine groups further split 
the birds into two groups. 

Figure 5a shows the representativeness as a function 
of the number of groups for the zoo problem. For a 
small number of groups R increases monotonically with 
increasing the number of groups, saturating to an approx- 
imate plateau at large group numbers. In the later region, 
there are small variations determined by the numerical 
accuracy of the algorithm computing the ML stratifica- 
tion for a fixed number of groups. A model with eight 
groups provide the highest degree of representativeness 
(Fig. 2c). Once again, a quick inspection is sufficient to 
realize that, indeed, this represent a natural subdivision 
of the animal population. 

The second real example concerns stratification ac- 
cording to genetic information. It consists of a popu- 
lation of ninety Caucasians and the genotype at differ- 
ent SNPs within the MDM4 gene, as reported by the 
HapMap project [9]. The MDM4 gene plays a key role 
in the p53 stress response pathway and genetic varia- 
tions within this gene could potentially result in differ- 
ent predispositions to cancer and/or response to cancer 
drugs therapy [10]. Focusing on SNPs with variation 
among this particular subpopulation we stratify its el- 
ements using the method described above. Figure 5b 
shows the representativeness of the ML stratification as 
a function of the number of groups. As for the zoo prob- 
lem, the representativeness increases monotonically for a 
small number of groups and saturates to a plateau with 
some variations determined by the numerical accuracy. 
At five groups we already observe a high degree of rep- 
resentation and eight groups represent the best choice of 
Ug according to the representativeness criterion. 

The genetic information for all individuals is shown in 
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FIG. 6: Stratification according to genotypic attributes: A population of ninety Caucasians is studied, focusing on SNPs 

within the MDM4 gene, as reported by the HapMap project [9]. SNPs with no variation within this particular subpopulation 
have been excluded. A, C, G and T represent the different nucleotides and NN represents data that is not available. The 
specific SNPs under consideration are indicated by the bottom labels, using the standard SNP notation. The figure shows the 
ML stratification for the case of five (left) and eight (right) groups, the later corresponding with the best choice of Ug according 
to the representativeness criterion ( Fig. 5b). The vertical lines in between indicate the SNPs at which the adjacent groups 
differ significantly. 



Fig. 6 stratified into five and eight groups, the later cor- 
responding with the highest representativeness stratifica- 
tion. The top and bottom groups arc almost entirely ho- 
mozygous (same letter) at every position. In contrast, all 
the intermediate groups are heterozygous (different let- 
ter) at several positions, which do not overlap between 
them in at least one position. A visual inspection of 
both stratifications indicates that they are very similar, 
as anticipated by the close values of representativiness 
between five and eight groups (Fig. 5b). 



DISCUSSION 

The mapping of either phenotypic or genetic infor- 
mation into a hypergraph offers significant advantages 
over the ciirrcnt reductionist mapping of the stratifica- 
tion problem into a network problem. First, the hyper- 
graph contains all the information provided by the orig- 
inal data. Second, it allow us to introduce an intuitive 
statistical model for the observed phenotypic/genotypic 
variations based on a postulated population stratification 
and the tendency of individuals within a group to exhibit 
certain phenotypic/genotypic feature. Finally, the gener- 
alization to problems dealing with both phenotypic and 
genotypic variation is straightforward, after introducing 
a hypergraph with two edge types. 
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The representativeness measiirc introduced here can be 
used as an alternative to model complexity when select- 
ing the optimal number of groups given the available in- 
formation. It is based on the interpretation of statistical 
significance in terms of information content, a philosophy 
with increasing recognition among the statistical model- 
ing commimity [3, 11]. This measure allow us to focus 
our analysis on a stratification obtained for a character- 
istic number of groups, with a high information content 
about stratifications with a different number of groups. 

Hypergraph partitioning has been already studied with 
applications to numerical linear algebra and logic circuit 
design [12]. The focus has been, however, on balance 
clustering which aims stratifications on groups of similar 
size. In contrast, the framework developed here is more 
suitable to determine a natural partition of the popu- 
lation (or the hypergraph representing it), potentially 
resulting in clusters of different sizes (see Fig. Ic, for 
example). It is worth noticing that our framework can 
be adapted to balance clustering as well, after adding 
the constraint that all groups have the same size to the 
starting statistical model. 
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